Protein - Structure Mapping, Alignments, and Visualization¶

This notebook gives an example of how to map a single protein sequence to its structure, along with conducting sequence alignments and visualizing the mutations.

Input: Protein ID + amino acid sequence + mutated sequence(s)

Output: Representative protein structure, sequence alignments, and visualization of mutations

Imports¶

In [1]:

import sys
import logging

In [2]:

# Import the Protein class
from ssbio.core.protein import Protein

In [3]:

# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging¶

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

Warning: DEBUG mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!

In [4]:

# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:

# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project¶

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROTEIN_ID will be created
PROTEIN_ID
- Your protein ID
PROTEIN_SEQ
- Your protein sequence

A directory will be created in ROOT_DIR with your PROTEIN_ID name. The folders are organized like so:

ROOT_DIR
└── PROTEIN_ID
    ├── sequences  # Protein sequence files, alignments, etc.
    └── structures  # Protein structure files, calculations, etc.

In [6]:

# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

class ssbio.core.protein.Protein(ident, description=None, root_dir=None, pdb_file_type=’mmtf’)[source]

Store information about a protein, which represents the monomeric translated unit of a gene.

The main utilities of this class are to:

Load, parse, and store the same (ie. from different database sources) or similar (ie. from different strains) protein sequences as SeqProp objects in the sequences attribute
Load, parse, and store multiple experimental or predicted protein structures as StructProp objects in the structures attribute
Set a single representative_sequence and representative_structure
Calculate, store, and access pairwise sequence alignments to the representative sequence or structure
Provide summaries of alignments and mutations seen
Map between residue numbers of sequences and structures

Parameters:	ident (str) – Unique identifier for this protein description (str) – Optional description for this protein root_dir (str) – Path to where the folder named by this protein’s ID will be created. Default is current working directory. pdb_file_type (str) – `pdb`, `mmCif`, `xml`, `mmtf` - file type for files downloaded from the PDB

Todo

Implement structural alignment objects with FATCAT

In [7]:

# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR, pdb_file_type='mmtf')

In [8]:

# Load the protein sequence
# This sets the loaded sequence as the representative one
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)

Out[8]:

<SeqProp WT at 0x7f5f79cbd908>

Mapping sequence –> structure¶

Since the sequence has been provided, we just need to BLAST it to the PDB.

Note: These methods do not download any 3D structure files.

Methods¶

Protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0, evalue=0.0001, display_link=False, outdir=None, force_rerun=False)[source]

BLAST the representative protein sequence to the PDB. Saves a raw BLAST result file (XML file).

Parameters:	seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form) evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default). display_link (bool, optional) – Set to True if links to the HTML results should be displayed outdir (str) – Path to output directory of downloaded XML files, must be set if protein directory was not initialized force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False.
Returns:	List of new `PDBProp` objects added to the `structures` attribute
Return type:	list

In [9]:

# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()

Out[9]:

['2zyd', '2zya', '3fwn', '2zyg']

Out[9]:

	pdb_chain_id	hit_score	hit_evalue	hit_percent_similar	hit_percent_ident	hit_num_ident	hit_num_similar
pdb_id
2zya	A	2319.0	0.0	0.987179	0.963675	451	462
2zya	B	2319.0	0.0	0.987179	0.963675	451	462
2zyd	A	2319.0	0.0	0.987179	0.963675	451	462
2zyd	B	2319.0	0.0	0.987179	0.963675	451	462
2zyg	A	2284.0	0.0	0.982906	0.950855	445	460

Downloading and ranking structures¶

Methods¶

Protein.pdb_downloader_and_metadata(outdir=None, pdb_file_type=None, force_rerun=False)[source]

Download ALL mapped experimental structures to the protein structures directory.

Parameters:	outdir (str) – Path to output directory, if protein structures directory not set or other output directory is desired pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired force_rerun (bool) – If files should be re-downloaded if they already exist
Returns:	List of PDB IDs that were downloaded
Return type:	list

Todo

Parse mmtf or PDB file for header information, rather than always getting the cif file for header info

Warning: Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

In [10]:

# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)

Out[10]:

['2zyd', '2zya', '3fwn', '2zyg']

Out[10]:

	pdb_title	description	experimental_method	mapped_chains	resolution	chemicals	taxonomy_name	structure_file
pdb_id
2zya	Dimeric 6-phosphogluconate dehydrogenase compl...	6-phosphogluconate dehydrogenase, decarboxylat...	X-RAY DIFFRACTION	A;B	1.6	6PG	Escherichia coli	2zya.mmtf
2zyd	Dimeric 6-phosphogluconate dehydrogenase compl...	6-phosphogluconate dehydrogenase, decarboxylat...	X-RAY DIFFRACTION	A;B	1.5	GLO	Escherichia coli	2zyd.mmtf

Protein.set_representative_structure(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, clean=True, keep_chemicals=None, skip_large_structures=False, force_rerun=False)[source]

Set a representative structure from a structure in the structures attribute.

Each gene can have a combination of the following, which will be analyzed to set a representative structure.

Homology model(s)
Ranked PDBs
BLASTed PDBs

If the always_use_homology flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.

Parameters:	seq_outdir (str) – Path to output directory of sequence alignment files, must be set if Protein directory was not created initially struct_outdir (str) – Path to output directory of structure files, must be set if Protein directory was not created initially pdb_file_type (str) – `pdb`, `mmCif`, `xml`, `mmtf` - file type for files downloaded from the PDB engine (str) – `biopython` or `needle` - which pairwise alignment program to use. `needle` is the standard EMBOSS tool to run pairwise alignments. `biopython` is Biopython’s implementation of needle. Results can differ! always_use_homology (bool) – If homology models should always be set as the representative structure rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure) seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications. allow_mutants (bool) – If mutations should be allowed or checked for allow_deletions (bool) – If deletions should be allowed or checked for allow_insertions (bool) – If insertions should be allowed or checked for allow_unresolved (bool) – If unresolved residues should be allowed or checked for clean (bool) – If structure should be cleaned keep_chemicals (str, list) – Keep specified chemical names if structure is to be cleaned skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true. force_rerun (bool) – If sequence to structure alignment should be rerun
Returns:	Representative structure from the list of structures. This is a not a map to the original structure, it is copied and optionally cleaned from the original one.
Return type:	StructProp

Todo

Remedy large structure representative setting

In [11]:

# Set representative structures
my_protein.set_representative_structure()

Out[11]:

<StructProp REP-2zyd at 0x7f5f79bea1d0>

Loading and aligning new sequences¶

You can load additional sequences into this protein object and align them to the representative sequence.

In [12]:

my_protein.__dict__

Out[12]:

{'_root_dir': '/tmp',
 'description': None,
 'id': 'SRR1753782_00918',
 'notes': {},
 'pdb_file_type': 'mmtf',
 'representative_chain': 'A',
 'representative_chain_seq_coverage': 95.7,
 'representative_sequence': <SeqProp WT at 0x7f5f79cbd908>,
 'representative_structure': <StructProp REP-2zyd at 0x7f5f79bea1d0>,
 'sequence_alignments': [<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79be7a90>,
  <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7f5f79299080>],
 'sequences': [<SeqProp WT at 0x7f5f79cbd908>],
 'structure_alignments': [],
 'structures': [<PDBProp 2zyd at 0x7f5fdb48e080>,
  <PDBProp 2zya at 0x7f5fd538a550>,
  <PDBProp 3fwn at 0x7f5f79cbde48>,
  <PDBProp 2zyg at 0x7f5f79cbdda0>,
  <StructProp REP-2zyd at 0x7f5f79bea1d0>]}

Methods¶

Protein.load_manual_sequence(seq, ident=None, write_fasta_file=False, outdir=None, set_as_representative=False, force_rewrite=False)[source]

Load a manual sequence given as a string and optionally set it as the representative sequence. Also store it in the sequences attribute.

Parameters:	seq (str, Seq, SeqRecord) – Sequence string, Biopython Seq or SeqRecord object ident (str) – Optional identifier for the sequence, required if seq is a string. Also will override existing IDs in Seq or SeqRecord objects if set. write_fasta_file (bool) – If this sequence should be written out to a FASTA file outdir (str) – Path to output directory set_as_representative (bool) – If this sequence should be set as the representative one force_rewrite (bool) – If the FASTA file should be overwritten if it already exists
Returns:	Sequence that was loaded into the `sequences` attribute
Return type:	SeqProp

In [ ]:

# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)

In [ ]:

# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)

Protein.pairwise_align_sequences_to_representative(gapopen=10, gapextend=0.5, outdir=None, engine=’needle’, parse=True, force_rerun=False)[source]

Pairwise all sequences in the sequences attribute to the representative sequence. Stores the alignments in the sequence_alignments DictList attribute.

Parameters:

gapopen (int) – Only for engine='needle' - Gap open penalty is the score taken away when a gap is created
gapextend (float) – Only for engine='needle' - Gap extension penalty is added to the standard gap penalty for each base or residue in the gap
outdir (str) – Only for engine='needle' - Path to output directory. Default is the protein sequence directory.
engine (str) – biopython or needle - which pairwise alignment program to use. needle is the standard EMBOSS tool to run pairwise alignments. biopython is Biopython’s implementation of needle. Results can differ!
parse (bool) – Store locations of mutations, insertions, and deletions in the alignment object (as an annotation)
force_rerun (bool) – Only for engine='needle' - Default False, set to True if you want to rerun the alignment if outfile exists.

In [ ]:

# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()

In [ ]:

# View IDs of all sequence alignments
[x.id for x in my_protein.sequence_alignments]

# View the stored information for one of the alignments
my_alignment = my_protein.sequence_alignments.get_by_id('SRR1753782_00918_N17P_SNP')
my_alignment.annotations
str(my_alignment[0].seq)
str(my_alignment[1].seq)

Protein.sequence_mutation_summary(alignment_ids=None, alignment_type=None)[source]

Summarize all mutations found in the sequence_alignments attribute.

Returns 2 dictionaries, single_counter and fingerprint_counter.

single_counter:

Dictionary of {point mutation: list of genes/strains} Example:

{
    ('A', 24, 'V'): ['Strain1', 'Strain2', 'Strain4'],
    ('R', 33, 'T'): ['Strain2']
}

Here, we report which genes/strains have the single point mutation.

fingerprint_counter:

Dictionary of {mutation group: list of genes/strains} Example:

{
    (('A', 24, 'V'), ('R', 33, 'T')): ['Strain2'],
    (('A', 24, 'V')): ['Strain1', 'Strain4']
}

Here, we report which genes/strains have the specific combinations (or “fingerprints”) of point mutations

Parameters:	alignment_ids (str, list) – Specified alignment ID or IDs to use alignment_type (str) – Specified alignment type contained in the `annotation` field of an alignment object, `seqalign` or `structalign` are the current types.
Returns:	single_counter, fingerprint_counter
Return type:	dict, dict

In [ ]:

# Summarize all the mutations in all sequence alignments
s,f = my_protein.sequence_mutation_summary(alignment_type='seqalign')
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f

Some additional methods¶

Getting binding site/other information from UniProt¶

In [ ]:

import ssbio.databases.uniprot

In [ ]:

this_examples_uniprot = 'P14062'
sites = ssbio.databases.uniprot.uniprot_sites(this_examples_uniprot)
my_protein.representative_sequence.features = sites
my_protein.representative_sequence.features

Mapping sequence residue numbers to structure residue numbers¶

Methods¶

Protein.map_seqprop_resnums_to_structprop_resnums(resnums, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Map a residue number in any SeqProp to the structure’s residue number for a specified chain.

Parameters:	resnums (int, list) – Residue numbers in the sequence seqprop (SeqProp) – SeqProp object structprop (StructProp) – StructProp object chain_id (str) – Chain ID to map to use_representatives (bool) – If the representative sequence and structure should be used. If True, seqprop, structprop, and chain_id do not need to be defined.
Returns:	Mapping of sequence residue numbers to structure residue numbers
Return type:	dict

In [ ]:

# Returns a dictionary mapping sequence residue numbers to structure residue identifiers
# Will warn you if residues are not present in the structure
structure_sites = my_protein.map_seqprop_resnums_to_structprop_resnums(resnums=[1,3,45],
                                                                       use_representatives=True)
structure_sites

Viewing structures¶

The awesome package nglview is utilized as a backend for viewing structures within a Jupyter notebook. ssbio view functions will either return a NGLWidget object, which is the same as using nglview like the below example, or act upon the widget object itself.

# This is how NGLview usually works - it will load a structure file and return a NGLWidget "view" object.
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view

Methods¶

StructProp.view_structure(only_chains=None, opacity=1.0, recolor=False, gui=False)[source]

Use NGLviewer to display a structure in a Jupyter notebook

Parameters:	only_chains (str, list) – Chain ID or IDs to display opacity (float) – Opacity of the structure recolor (bool) – If structure should be cleaned and recolored to silver gui (bool) – If the NGLview GUI should show up
Returns:	NGLviewer object

In [ ]:

# View just the structure
view = my_protein.representative_structure.view_structure()
view

Protein.add_mutations_to_nglview(view, alignment_type=’seqalign’, alignment_ids=None, seqprop=None, structprop=None, chain_id=None, use_representatives=False, grouped=False, color=’red’, unique_colors=True, opacity_range=(0.8, 1), scale_range=(1, 5))[source]

Add representations to an NGLWidget view object for residues that are mutated in the sequence_alignments attribute.

Parameters:

view (NGLWidget) – NGLWidget view object
alignment_type (str) – Specified alignment type contained in the annotation field of an alignment object, seqalign or structalign are the current types.
alignment_ids (str, list) – Specified alignment ID or IDs to use
seqprop (SeqProp) – SeqProp object
structprop (StructProp) – StructProp object
chain_id (str) – ID of the structure’s chain to get annotation from
use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
grouped (bool) – If groups of mutations should be colored and sized together
color (str) – Color of the mutations (overridden if unique_colors=True)
unique_colors (bool) – If each mutation/mutation group should be colored uniquely
opacity_range (tuple) – Min/max opacity values (mutations that show up more will be opaque)
scale_range (tuple) – Min/max size values (mutations that show up more will be bigger)

In [ ]:

# Map the mutations on the visualization (scale increased) - will show up on the above view
my_protein.add_mutations_to_nglview(view=view, alignment_type='seqalign', scale_range=(4,7),
                                    use_representatives=True)

Protein.add_features_to_nglview(view, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source]

Add select features from the selected SeqProp object to an NGLWidget view object.

Currently parsing for:

Single residue features (ie. metal binding sites)

Disulfide bonds

Parameters:	view (NGLWidget) – NGLWidget view object seqprop (SeqProp) – SeqProp object structprop (StructProp) – StructProp object chain_id (str) – ID of the structure’s chain to get annotation from use_representatives (bool) – If the representative sequence/structure/chain IDs should be used

In [ ]:

# Add sites as shown above in the table to the view
my_protein.add_features_to_nglview(view=view, use_representatives=True)

Saving¶

Protein.save_json(outfile, compression=False): Save the object as a JSON file using json_tricks

In [ ]:

import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, '{}.json'.format(my_protein.id)))